Introduction to Open Data Science - Course Project

About the project

I wanted to gain some knowledge in data analysis for quite a long time. I found this course by simply surfing through list of courses with open registration date. This is an amazing opportunity to learn basics of up to date data analysis from experts.

The textbook is very well organized and easy to learn with. However it is always not so easy to digest the information about the tools you haven’t used yet.

My github repository

https://github.com/SergeiRaik/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Mon Nov 28 23:33:47 2022"

Regression and model validation

Introduction

This week I studied the regression models. Those are extremely important in data analysis as they bear several functions. They help us explain the relationship between variables and make a hypothesis how some parameters affect target variables. Another function is prediction. Using regression model we can predict unknown values of target variable by a set of explanatory vatiables. Here we used a dataset derived from 2014 year questionare aimed at identifying how various learning strategies (deep, surface and strategic) affect learning outcome (exam points).

1. Import the dataset from file.

# import necessary packages
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   0.3.5
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
#import dataset from file and set proper column types
learning2014 <- read_csv("data/learning2014.csv", col_types = cols(.default = "n", age = "i", gender = "f", points = "i"))

2. Dataset structure.

str(learning2014) #stucture of the dataset
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.37 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   .default = col_number(),
##   ..   gender = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   age = col_integer(),
##   ..   attitude = col_number(),
##   ..   deep = col_number(),
##   ..   stra = col_number(),
##   ..   surf = col_number(),
##   ..   points = col_integer()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(learning2014) # dimensions of the dataframe
## [1] 166   7

The dataset consists of 7 variables and 166 observations. The variable gender, which is 2 levels factor is respondents’ gender. Integers age and points are respondents’ age in years, and exam points respectively. Numerical variables attitude, deep, stra, and surf are derived from respondents’ answers and scaled to 0-5 points scale. More information on the data can be found here. Briefly they represent three learning strategies (deep, surface and strategic) and general attitude towards statistics.

3. Overview of the data.

First let’s take a look on the summary of each variable.

summary(learning2014) # dataframe summary
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

It can be seen that data related to learning strategies and attitude has the same range (0-5). The number of female participants is twice as high as male.

Let’s try to plot variables with each other and try to visually observe any correlation.

pairs(learning2014[-1]) # plot every variable against each other excluding 1st column (gender)

Due to high variation it is hard to visually observe any correlation from plots. We can therefore compare data using the ggpairs() function from ggplot2 library which provides also variable distribution plots and correlation coefficients.

# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, mapping = aes(alpha = 0.3, col = gender), lower = list(combo = wrap("facethist", bins = 20)))

From variables distribution graphs shape we can conclude that all of them except age have more or less symmetrical bell shape. To prove that let’s make a QQ plot for each variable. Simply said, QQ plot compares the distribution with the normal distribution. The graph should be linear if two normally distributed values are compares.

par(mfrow = c(2,3)) # create a plot matrix
qqplot(learning2014$age # create a qq-plot for each variable to check if it is normally distributed
       , rnorm(50))
qqplot(learning2014$attitude
       , rnorm(50))
qqplot(learning2014$deep
       , rnorm(50))
qqplot(learning2014$stra
       , rnorm(50))
qqplot(learning2014$surf
       , rnorm(50))
qqplot(learning2014$points
       , rnorm(50))

Indeed, we can see that age is not normally distributed here. Other plots are linear.

4. Regression model

Let’s create a linear regression model where exam points are target variable and explanatory variables would be attitude, stra and surf, based on their correlation coefficients from previous section.

# create a multivariable regression model
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# create a model summary
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

From p-values we can conclude that the attitude has the highest influence on the target variable. stra and surf are insignificant. Therefore we can simplify the model to a single parameter without losing much predictability.

# create a simplified model with single variable
my_model2 <- lm(points ~ attitude, data = learning2014)
# create a model summary
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Indeed, the R-squared decreased roughly from 0.20 to 0.19. However both models are not very precise as they explain ~20 % of points variability. From the regression model it can be concluded that global attitude towards statistics positively correlates with exam outcome, however the influence is not very large.

5. Diagnostic plots

Let’s have a look at the plot of exam points vs attitude.

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm") # points vs attitude linear plot
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'

Our linear model suggests that there is linear relationship between exam points and general attitude towards statistics. To prove that we can make diagnostic plots.

plot(my_model2, which = 1) # residuals vs fitted 

Residuals vs. fitted plot shows the difference between observed and fitted values. If the observed value equals exactly the predicted value, the residual value is zero. From our plot we can conclude that the assumption that the relationship is linear is reasonable as residuals are randomly distributed around the zero line. We can also observe values that do not fit the model: in observations 35, 56 and 145 students got unexpectedly low exam points despite their high attitude.

plot(my_model2, which = 2) # QQ-plot

To check if the residuals are distributed normally, we made a QQ-plot where quantiles of residuals distribution are plotted against theoretical normal distribution quantiles. In our case the graph is fairly linear, however there is some deviation at its extremes.

plot(my_model2, which = 5) # residuals vs leverage plot

The Residuals vs Leverage plot shows us that none of the points fall outside the Cook’s distance, and therefore can not be considered influential observations.


Logistic regression

Introduction

This week we were working on a logistic regression. This is the method of estimation the relationships between categorical variables. This method is an extension of linear regression. The dataset we’ve been working with is describing student achievement in secondary education of two Portuguese schools. The dataset includes information regarding family status, school performance, spare time, alcohol consumption etc. The complete description is published on the UCI Machine Learning Repository.

1. Dataset structure

Let’s import the wrangled dataset and take a look at it’s structure.

# import necessary packages
library(ggplot2)
library(dplyr)
library(GGally)
library(tidyverse)

#import dataset from file and set proper column types
alc <- read_csv("data/alc.csv", show_col_types = FALSE, col_types = cols(.default = "f", age = "i", Medu = "i", Fedu = "i", traveltime = "i", studytime = "i", famrel = "i", freetime = "i", goout = "i", Dalc = "i", Walc = "i", health = "i", failures = "n", absences = "n", G1 = "n", G2 = "n", G3 = "n", alc_use = "n", high_use = "l"))

#check dataset structure, dimensions and column types
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services, other, …
## $ Fjob       <fct> teacher, other, other, services, other, other, other, teach…
## $ reason     <fct> course, course, other, home, home, reputation, home, home, …
## $ guardian   <fct> mother, father, mother, mother, father, mother, mother, mot…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

The dataset consists of 35 variables and 370 observations.

2. Variables relationship hypotheses

Hypothesis 1 People involved in romantic relationships are less likely to be heavy drinkers.

alc %>% group_by(romantic,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'romantic'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   romantic [2]
##   romantic high_use count
##   <fct>    <lgl>    <int>
## 1 no       FALSE      173
## 2 no       TRUE        78
## 3 yes      FALSE       86
## 4 yes      TRUE        33
g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("romantic")+geom_bar()

From both tabular and visual summary it is obvious that within both groups (single and people in relationships) the fraction of those who consume large amounts of

Hypothesis 2 People who go out with friends more often tend to drink more.

# group by frequency of going out and alcohol coonsumption
alc %>% group_by(goout,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'goout'. You can override using the
## `.groups` argument.
## # A tibble: 10 × 3
## # Groups:   goout [5]
##    goout high_use count
##    <int> <lgl>    <int>
##  1     1 FALSE       19
##  2     1 TRUE         3
##  3     2 FALSE       82
##  4     2 TRUE        15
##  5     3 FALSE       97
##  6     3 TRUE        23
##  7     4 FALSE       40
##  8     4 TRUE        38
##  9     5 FALSE       21
## 10     5 TRUE        32
g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("goout")+geom_bar()

Indeed, it is clearly seen that those who go out with friends more often, tend to drink more.

Hypothesis 3 People who spend more time studying, don’t have time for drinking.

alc %>% group_by(studytime,high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'studytime'. You can override using the
## `.groups` argument.
## # A tibble: 8 × 3
## # Groups:   studytime [4]
##   studytime high_use count
##       <int> <lgl>    <int>
## 1         1 FALSE       56
## 2         1 TRUE        42
## 3         2 FALSE      128
## 4         2 TRUE        57
## 5         3 FALSE       52
## 6         3 TRUE         8
## 7         4 FALSE       23
## 8         4 TRUE         4

The correlation can be observed here as well. As you can see, within group of students who spend less than 2 hours for studying, there’s almost a half of “heavy drinkers”, whereas for those who study more than 10 hours, it decreases to ~14 %.

g <- ggplot(data = alc, aes(x=high_use))
g+facet_wrap("studytime")+geom_bar()

Hypothesis 4 People who consume large amounts of alcohol, have lower grades.

To analyze grades, we first take an average of three grades and then compare the distributions across groups of people who consume alcohol a lot and a bit.

alc$grades <- rowMeans(alc[, c("G1","G2","G3")])
g <- ggplot(data = alc, aes(x=high_use, y = grades))
g + geom_boxplot()

From boxplot graph it is not obvious whether alcohol consumption have an effect on grades. It seems that for low-drinkers the distribution is broader and the mean value is higher whereas for high-drinkers the distribution of grades is tilted towards lower values.

alc %>% group_by(high_use) %>% summarise(mean(grades))
## # A tibble: 2 × 2
##   high_use `mean(grades)`
##   <lgl>             <dbl>
## 1 FALSE              11.8
## 2 TRUE               10.9

Mean values of grades differ not that much. Let’s take a look at each distribution separately.

g <- ggplot(data = alc, aes(x=grades))
g + geom_density(alpha=0.2)+facet_grid(~high_use)

Here it seems that for low-drinkers there is a bimodal distribution. The mode with higher grades is absent in case of high-drinkers. However we’ll see whether the mentioned parameters are significant or not in regression model.

3. Logistic regression model

To begin with we create a model of high_use as a target variable and those mentioned in the hypotheses section as explanatory variables (grades, studytime, goout, romantic),

# create a logistic model and it's summary
m <- glm(high_use ~ grades + studytime + goout + romantic, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ grades + studytime + goout + romantic, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7852  -0.7805  -0.5395   0.9537   2.5739  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.43474    0.73913  -1.941  0.05224 .  
## grades      -0.05210    0.04664  -1.117  0.26400    
## studytime   -0.58343    0.16813  -3.470  0.00052 ***
## goout        0.72552    0.11848   6.123 9.16e-10 ***
## romanticyes -0.19076    0.27253  -0.700  0.48396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 385.84  on 365  degrees of freedom
## AIC: 395.84
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.2381778 0.05452509 0.9971120
## grades      0.9492380 0.86558442 1.0397804
## studytime   0.5579787 0.39679174 0.7685481
## goout       2.0658027 1.64754827 2.6242897
## romanticyes 0.8263314 0.48042098 1.4021977

When the odds ratio is close to 1, it means that changing explanatory variable the odds of target variable does not change. Grades and romantic relationships with OR of 0.95 and 0.83 respectively do not correlate with alcohol consumption. Meanwhile each point of ‘goout’ variable doubles alcohol consumption odds. On the contrary each point of studytime decreases alcohol consumption odds by half.

4. Predictive power of the model

Let’s use our model for prediction of alcohol consumption level by calculating the probability of high_use with predict() function.

#calculate the probabilities of high_use using logistic model
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)

#prediction is true if probability > 0.5
alc <- mutate(alc, prediction = probability>0.5) 
alc %>% group_by(high_use,prediction) %>% summarise(count = n())
## `summarise()` has grouped output by 'high_use'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   high_use [2]
##   high_use prediction count
##   <lgl>    <lgl>      <int>
## 1 FALSE    FALSE        238
## 2 FALSE    TRUE          21
## 3 TRUE     FALSE         70
## 4 TRUE     TRUE          41

From the summarizing table we can conclude that when model predicts low level of alcohol consumption (FALSE), it is correct in 238 / (238 + 70) = 77 %. When high - 41 / (41 + 21) = 66 %. Overall the model prediction is wrong in (21 + 70) / (238 + 21 + 70 + 41) = 25 %. Overall if we compare the model with simple guessing strategy (0.5 % probability), it produces 25 % less wrong predictions which is not bad considering small number of variables taken. ***

Clustering and classification

Introduction

The dataset we are exploring today contains information about different suburban areas near Boston. It includes information about environmental conditions of touns, population, characteristics of housing and transportation, crime rate etc. The full description can be found here.

1. Loading the dataset

Let’s load necessary libraries and the Boston dataset

# access the MASS package
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyverse)
# load the data
data("Boston")
#explore structure and dimensions 
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

The dataset consists of 14 variables and 506 observations.

2. Visualization of variables and correlations

Analyzing datasets it is always useful to have a look at variables distributions. To do that, we first convert data from wide format to long format (variable-value dictionary) using melt() function from reshape library. Then we visualize distributions using ggplot2 library.

library(ggplot2)
library(reshape)
## 
## Attaching package: 'reshape'
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## The following object is masked from 'package:dplyr':
## 
##     rename
melt.boston <- melt(Boston)
## Using  as id variables
ggplot(data = melt.boston, aes(x = value)) + 
stat_density() + 
facet_wrap(~variable, scales = "free")

library(MASS)
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) 

# print the correlation matrix
cor_matrix %>% 
  round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

3. Scaling the dataset

To scale the dataset we substract each variable mean value and divide by standard deviation (SD).

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

As a result of scaling the dataset, we derivatized variables in a following way. Mean value is zero and observations that deviate from mean by SD, is set to ±1.

4. Turning crime rate into categorical variable

To build up a model for prediction of crime rate we have to first turn it into categorical variable. To do that we make a quantile vector with the corresponding function.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610

Then we break the crim variable into 5-level factor, add substitute the initial variable in the scaled dataset.

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

5. LDA model

Then we have to split the dataset into train and test samples.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2400990 0.2549505 0.2549505 0.2500000 
## 
## Group means:
##                  zn      indus         chas        nox         rm        age
## low       1.0992139 -0.9180158 -0.109974419 -0.8697067  0.5347381 -0.8816403
## med_low  -0.1055088 -0.2798523 -0.004759149 -0.5385400 -0.1658617 -0.3028679
## med_high -0.3781741  0.2428217  0.262810770  0.3971722  0.0481647  0.4185395
## high     -0.4872402  1.0171306 -0.155385496  1.0638394 -0.3635076  0.7921939
##                 dis        rad        tax    ptratio      black       lstat
## low       0.7908651 -0.6846906 -0.7373948 -0.5408901  0.3739067 -0.79587223
## med_low   0.3524172 -0.5436697 -0.4405372 -0.0700480  0.3488275 -0.08236995
## med_high -0.4306980 -0.3953725 -0.2752081 -0.3140058  0.1392144  0.04280521
## high     -0.8337660  1.6379981  1.5139626  0.7806252 -0.7684682  0.91251737
##                medv
## low       0.6237035
## med_low  -0.0358693
## med_high  0.1532996
## high     -0.7646762
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.09706622  0.883334496 -0.94242333
## indus    0.02908800 -0.299005341  0.03801384
## chas    -0.10357188 -0.132158731  0.02450388
## nox      0.41827024 -0.553512503 -1.25265429
## rm      -0.07171519  0.017501315 -0.09048660
## age      0.25776905 -0.291202406 -0.08295926
## dis     -0.05671087 -0.386126597  0.48550443
## rad      3.03325833  0.902362544 -0.10471743
## tax     -0.08175475 -0.123037846  0.80500206
## ptratio  0.09860672  0.079471045 -0.24248560
## black   -0.15891680 -0.008499813  0.15863175
## lstat    0.20384612 -0.201588997  0.37737472
## medv     0.13788742 -0.377841976 -0.26380794
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9369 0.0450 0.0181
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

From the LDA plot we can clearly see that observations with high crime rate are very well separated. Others are overlapping which will clearly affect the prediction power of the model.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      17        1    0
##   med_low    4      16        3    0
##   med_high   0      10       13    0
##   high       0       0        0   26

Indeed, as you can see, the best results are obtained for high cluster, whereas for others there are a lot of false predictions.

data('Boston')
bst <- scale(Boston) 
bst <- as.data.frame(bst) 
dist_eu <- dist(bst) 
km <- kmeans(x = bst, centers = 3)

Then we determine the optimal number of clusters by calculating the total sum of squares.

set.seed(21) 
#set the max clusters numbers
k_max <- 10 

#calculate the total sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(bst, k)$tot.withinss}) 
library(ggplot2)
qplot(x = 1:k_max, y = twcss, geom = 'line') 

The total WCSS decreases dramatically at 2 which means it is the optimal number of clusters.

Lets run the code again and plot it:

km <-kmeans(bst, centers = 2)
pairs(bst, col = km$cluster)

Super bonus 3D plots

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

First we set the color to be the crime classes of the train set. 

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:reshape':
## 
##     rename
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)

Second we make the color be defined by the clusters of the k-means

km1 <- dplyr::select(train, -crime)
km1 <- kmeans(km1, centers = 4)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km1$cluster)

The clusters are more or less similar, however observations with higher crime rate are better separated when using crime class. (more chapters to be added similarly as we proceed with the course!)